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Abstract. We investigate the generalization ability of a simple percep- 
tron trained in the off-line and on-line supervised modes. Examples are 
extracted from the teacher who is a non-monotonic perceptron. For this 
system, difficulties of training can be controlled continuously by chang- 
ing a parameter of the teacher. We train the student by several learning 
strategies in order to obtain the theoretical lower bounds of generaliza- 
tion errors under various conditions. Asymptotic behavior of the learning 
curve has been derived, which enables us to determine the most suitable 
learning algorithm for a given value of the parameter controlling difficul- 
ties of training. 



1 Introduction 

Learning from examples has been one of the most attractive problems for compu- 
tational neuroscientists [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. For a given system, superiority 
of the learning strategy should be measured by the generalization error, namely 
the probability of disagreement between the teacher and student outputs for a 
new example after the student has been trained. Much efforts have been invested 
into investigations in the case of learnable rules, and it is desirable to construct 
suitable learning strategies and minimize the residual generalization error even 
if it is impossible for the student to reproduce the teacher input-output relations 
perfectly. In the present contribution we investigate the generalization error for 
such an unlearnable case [11, 12, 13, 14, 15, 16, 17]. 

In our model system, the student is a simple perceptron whose output is given 
as S{u) = sign(M) with it=-\/iV(J-x)/|J|, where J is the synaptic weight vector 
and X is a random input vector which is extracted from the iV-dimensional sphere 
|xp = 1. The teacher is a non-monotonic (or reversed-wedge type) perceptron 
whose output is represented as Ta{v) = sign[i;(a — v){a -\- v)] with u=-\//V(J*'-x). 
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The weight vector of the teacher has been written as J*^. If a = or a = oo, the 
student can learn the teacher rule perfectly, the learnable case. 

If the width a of the reversed wedge is finite, the student can not reproduce 
the teacher input-output relations perfectly and the generalization error remains 
non-vanishing even after infinite number of examples have been presented. For 
this system, when the overlap between the teacher and student is written as 
R= (J-J'^)/|J||J'^|, the generalization error Cg is 



eg = ^0{-Ta{v)S{u)):^ 
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where H{x) = Dt with Dt = exp{-t'^/2)/V2TT and <• • •> stands for the 
averaging over the connected Gaussian distribution: 
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It is important that this expression is independent of specific learning algorithms. 
In Fig. 1 we plot E{R) for several values of a. 




Fig. 1. Generalization error as a function of i? for o = oo, 2, 1, 0.5 and a = 0. 



Minimization of E{R) with respect to R gives the theoretical lower bound 
of the generalization error. In Fig. 2 we show the theoretical lower bound cor- 



responding to the minimum value of E{R) in Fig. 1 and in Fig. 3 we plot the 
corresponding optimal overlap i?opt which gives the bound. 




Fig. 2. The best possible value (theoretical lower bound) of the generalization error, 
the residual generalization errors of conventional Hebbian, perceptron and AdaTron 
learning algorithms are plotted as functions of a. 



From Fig. 3 we see that one should train the student so that R becomes 
1 for o > ac2 = 0.80. For a < ac2 = 0.80, the optimal R is not 1 but R^, = 
— ■\/(21og2 — a2)/21og2. This system shows the first order phase transition at 
a = ac2 and the optimal overlap changes from 1 to ii* discontinuously. 

In the following sections, we investigate various learning strategies to clarify 
the asymptotic behavior of learning curves. 



2 Off-line learning 



We first investigate the generalization ability of the student in off-line (or batch) 
mode following the minimum error algorithm. The minimum error algorithm is 
a natural learning strategy to minimize the total error for P sets of examples 

in 

p 

E{3\{e}) = Y.^{-T!i-un (3) 



0.5 

R 











1 ^ .'' 
1 /■ . • 












! * *' 






' /' •* 
! ' •'* 






1 

■A* 












f I 




,' / 


1 Optimal value 

; Perceptron 

1 Hebbian 

1 AdaTron 




/ 








ij 2 3 4 


5 



ac2 del 



Fig. 3. The optimal overlap R which gives the best possible value and overlaps which 
give the residual errors in Fig. 2 for Hebbian, perceptron and AdaTron learning algo- 
rithms. 



where we set = (J-x'^)/-\/iV. From the energy defined by Eq. (3), the partition 
function with the inverse temperature /3 is given by 

Z{(3) = J d3S{\3\'-N)exp{-PE{3\{e})) 
p 

dJ 5{\3f - TV) n [e-'' + (1 - e-f')0{-Ti:-un] (4) 

There exists weight vectors that reproduce input-output relations completely if 
a = P/N is smaller than a critical capacity ac- Therefore, we can calculate the 
learning curve (LC) below Uc by evaluating the logarithm of the Gardner-Derrida 
volume Vgd = Z{oo) as 

lOgVcD _ «:log^(oo)»{fP} _ 1 «Z"(00)»{|P} - 1 

- N ~ Nn^o n • 

On the other hand, at a = ac, Vgd shrinks to zero and for a > ac, we can not 
find the solution in the weight space. Then, we treat the next free energy 

to find the solution weight J which gives a minimum error for a > ac- Introducing 
the order parameters = {3^-3a)/N and qa0 = {3a-30)/N and using the 




replica symmetric approximation = R and qaf} = q, Eq. (5) is evaluated as 



ext{fl,q} |2a J Dtn{R/^ : t)\ogE{q : t) + ilog(l - q) 
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with 



Q{R:t)= j Dz 6>(-z\/l - R'^ - Rt - a) + 0{zy/l - R'^ + Rt) 

-&{z\/T^^+Rt-a)\, (8) 

S{q : t) = J Dz 0{zVT^ + t^). (9) 
And Eq. (6) is evaluated as 

ext{fl,^}|-2a j Dtn{R:t)i^e{-t-\/2x) + ^0{t + V2x)^ 

where we have set x = — q) to find a non-trivial solution in the limit of 
P—Kyo and q—^1. By solving the saddle point equation from Eqs. (7) and (10), 
we found that the LC is classified into the following five types depending on the 
parameter a. 

• a = 0, oo (learnable case) 

The solutions of the saddle point equation are thermodynamically stable and 
the LC behaves asymptotically as [18, 19] 

eg ~ 0.624 (11) 

• a > Gco ^ 1.53 

The order parameter R monotonically increases to 1 as a— >oo. The LC be- 
haves asymptotically as 

^min 01 . (1^) 

• ttcO > a > Ocl 

A first order phase transition from the poor generalization phase to the good 
generalization phase is observed at a~C'(l) in this parameter region (see 
Fig. 4). In the limit a^oo, R approaches to 1 which achieves the global mini- 
mum of the generalization error in this parameter region and the asymptotic 
LC is identical to Eq. (12). 

• Oci > a > ac2 

The first order phase transition is observed similarly to the previous parame- 
ter region of a (see Fig. 5). However, the spinodal point agp becomes infinity. 
The asymptotic form of the LC for this parameter region of a is the same as 
Eq. (12). 



• ac2 > a > 

In this parameter region E{R) is minimized not at i? = 1 but at i? = i?*. 
Therefore, the solution {R,x) = (i?*,0) is the global minimum of the free 
energy for all values of a and there is no phase transition. The LC decays to 
its minimum as 

This result implies that the non-monotonic teacher with small a is more difBcult 
for a simple perceptron to learn than that with large a [15]. We conclude that 
minimum error algorithm can lead to the best possible value of the generalization 
error (see Fig. 2) for all values of a. Watkin and Rau [11] also investigated the 
LC for the same system as ours, however, they investigated only 0(1) range of 
a. In this section, we investigated the LC for all ranges of a. 




100 



Fig. 4. The learning curve for the case of a - 
at ath — 14.7. The spinodal point is at asp '- 



■ 1.3. A first order phase transition appears 
;24.2. 



3 On-line leeirning dynamics 

3.1 Conventional on-line learning algorithms 

The on-line learning dynamics we investigate in this work is generally written 
as follows. 

J™+i=J™+5/(T„(t;),«)x, (14) 
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Fig. 5. The learning curve for tlie case of o = 1.0. A first order pliase transition 
appears at ath — 47 and eg changes discontinuously from the branch 1 to the branch 2. 
The spinodal point agp has gone to infinity. 



where m is the number of the presented patterns and g is the learning rate. In 
the limit of large N , the recursion relation Eq. (14) of the A^-dimensional vector 
J™ is reduced to a set of differential equations for R and I = \3\/\/N: 

£ = Ji^a^fiTaiv), u) + 2gf{T,{v), u)ul-» (15) 

^ = ^« - ^9^f{T,{v),u) - {Ru-v)gf{Ta{v),u)l» (16) 

where a is the number of presented patterns per system size m/N. In the present 
subsection we set g = 1. We now restrict ourselves to the following well-known 
algorithms: 

• Perceptron learning : / = —S{u)0{—Ta{v)S{u)) 

• Hebbian learning : / = —Ta(v) 

• AdaTron learning : / = -uO {-Ta{v)S(u)). 

For the above three learning strategies, asymptotic forms of the generalization 
error for the learnable case are given as [2, 3]: 

• Perceptron learning : eg~a~^/^ 

• Hebbian learning : eg <^ oT^I'^ 

• AdaTron learning : a~^. 
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Fig. 6. Generalization errors of tlie AdaTron, perceptron and Hebbian learning algo- 
rithms for the case a = 2.0. The AdaTron learning became the warst algorithm among 
the three. 



On the other hand, for the unlearnable case, the generalization error con- 
verges exponentially to a-dependent non-zero values both for perceptron and 
AdaTron learnings. Unfortunately, these residual errors are not necessarily the 
best possible value as seen in Fig. 2. From this figure, we see that for the un- 
learnable case the AdaTron learning is not superior to the perceptron learning, 
although the AdaTron learning is regarded as the most sophisticated learning 
algorithm for the learnable case [20]. In Fig. 6 wc plot the generalization er- 
ror of the perceptron, Hebbian and AdaTron learnings for the unlearnable case 
(a = 2.0). 

For the Hebbian learning, the generalization error converges to 2H{a) for 
a > Oci = V21og2 and to 1 — 2H{a) for a < ad as a~^/^. For a > Ud, this 
residual error 2H{a) corresponds to the optimal value. However, for a < ad, 
the generalization error of the Hebbian learning exceeds 0.5 and, in addition, an 
over training is observed (Figs. 2, 3) 

This difficulty can be avoided partially by allowing the student to select 
suitable examples [21]. If the student uses only examples which lie in the decision 
boundary, that is, if examples satisfy u = 0, the generalization error converges 
to the optimal value as a"^/^ except only for ac2 < a < ad- 



3.2 Optimization of learning rate 



We next regard the learning rate g as a, function of a and and construct an algo- 
rithm by optimizing g. In order to decide the optimal rate ^opt we maximize the 
right hand side of equation (14) with respect to g. This procedure is somewhat 
similar to the processes of determining the annealing schedule. This optimization 
procedure is different from the method of Kinouchi and Caticha [22]. 

We apply this technique to the case of the pcrceptron, the Hebbian and 
the AdaTron learning algorithms. For the perceptron learning, this optimization 
procedure leads to the asymptotic form of generalization error as 



na 



for the learnable case and to 



for the unlearnable case, where 2H(a) is the optimal value for a > ac2- In 
the asymptotic region a— +oo, the learning rate gopt behaves as ^opt ~ l/oi- This 
learning strategy thus seems to work well for a> ac2- However, at a = ad, this 
optimization procedure fails to reach the best possible value of the generalization 
error and the generalization ability deteriorates to 0.5 (which is equal to the 
result by the random guess) [16]. The reason is that for a = Od the optimal 
learning rate (/opt vanishes. 

For the AdaTron learning, this type of optimization procedure gives the gen- 
eralization ability as 

^''^ 

for the learnable case and 



e. = ma) + -^^i^ (20) 

for the unlearnable rule. Fortunately, for the AdaTron learning, the optimal 
learning late docs not vanish even at a = Ud, and therefore this optimization 
procedure works effectively for a > ac2 [17]. 

On the other hand, for the Hebbian learning, the above optimization pro- 
cedure does not change the asymptotic form of the generalization error [16]. 
Nevertheless, if we introduce the optimal learning rate (jopt into the Hebbian 
learning with queries, we get the very fast convergence of generalization error as 

eg = 2H{a) + ^exp{--), (21) 

TT TT 

where c is a positive constant. 

The present optimization procedure does not work effectively for a < ac2 
because the key point of this method consists in pushing the student toward the 
state R = 1 and this state is not optimal for a < ac2 (see Fig. 2). 



4 Remarks 



In the present work, we have found that the off-hne learning obtain the best 
possible vahic of the generahzation error for the whole range of a. On the other 
hand, the conventional on-line learning algorithm should be improved. We could 
improve the conventional on-line learning strategies by introducing the time- 
dependent optimal learning rate, and queries. We could obtain the theoretical 
lower bound of the generalization error for the whole parameter range in the 
on-line mode. As our optimal learning rate contains the parameter a unknown 
to the student, the result can be regarded only as a lower bound of the gener- 
alization error. However, if one uses the asymptotic form of gopt) the parameter 
independent learning algorithm can be formulated and the same generalization 
ability as the parameter dependent case can be obtained [16, 17]. 
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